Customer Segmentation and Market Basket Analysis#
Introduction:
This notebook we using publicly dataset On-line Retail to explore customer segmentation through the interesting task of unsupervised learning method. Then we go further and apply association rule mining approach to find interesting rules and patterns in this transaction database. These customer segmentation, rules and patterns can be used to make interesting and useful decisions as far as user interest is concerned.
The Online Retail a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.
But first, let’s see some deep understand of both concepts:
Customer segmentation: Customer segmentation is the problem of uncovering information about a firm’s customer base, based on their interactions with the business. In most cases this interaction is in terms of their purchase behavior and patterns. We explore some of the ways in which this can be used.
Market basket analysis: Market basket analysis is a method to gain insights into granular behavior of customers. This is helpful in devising strategies which uncovers deeper understanding of purchase decisions taken by the customers. This is interesting as a lot of times even the customer will be unaware of such biases or trends in their purchasing behavior.
Table of Contents
- 1 Load Dependencies and Configuration Settings
- 2 Load Dataset
- 3 Exploratory Data Analysis (EDA)
- 4 Customer Segmentation:
- 5 Cross Selling
Load Dependencies and Configuration Settings#
We started with the installation of the orange3 package through the command line, since it is not possible to include it through the usual procedure of adding custom packages in the Kernel.
pip install orange3
^C
Note: you may need to restart the kernel to use updated packages.
pip install pandasql
Requirement already satisfied: pandasql in /usr/local/python/3.10.13/lib/python3.10/site-packages (0.7.3)
Requirement already satisfied: numpy in /home/codespace/.local/lib/python3.10/site-packages (from pandasql) (1.26.4)
Requirement already satisfied: pandas in /home/codespace/.local/lib/python3.10/site-packages (from pandasql) (2.2.1)
Requirement already satisfied: sqlalchemy in /usr/local/python/3.10.13/lib/python3.10/site-packages (from pandasql) (2.0.29)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/codespace/.local/lib/python3.10/site-packages (from pandas->pandasql) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/codespace/.local/lib/python3.10/site-packages (from pandas->pandasql) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /home/codespace/.local/lib/python3.10/site-packages (from pandas->pandasql) (2024.1)
Requirement already satisfied: typing-extensions>=4.6.0 in /home/codespace/.local/lib/python3.10/site-packages (from sqlalchemy->pandasql) (4.10.0)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/python/3.10.13/lib/python3.10/site-packages (from sqlalchemy->pandasql) (3.0.3)
Requirement already satisfied: six>=1.5 in /home/codespace/.local/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas->pandasql) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
import os
import warnings
warnings.simplefilter(action = 'ignore', category=FutureWarning)
warnings.filterwarnings('ignore')
def ignore_warn(*args, **kwargs):
pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)
import pandas as pd
import datetime
import math
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib.cm as cm
%matplotlib inline
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
import seaborn as sns
sns.set(style="ticks", color_codes=True, font_scale=1.5)
color = sns.color_palette()
sns.set_style('darkgrid')
from mpl_toolkits.mplot3d import Axes3D
import plotly as py
import plotly.graph_objs as go
py.offline.init_notebook_mode()
from scipy import stats
from scipy.stats import skew, norm, probplot, boxcox
from sklearn import preprocessing
import math
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import Orange
from Orange.data import Domain, DiscreteVariable, ContinuousVariable
from orangecontrib.associate.fpgrowth import *
cs_df = pd.read_excel(io=r'/workspaces/tugas/OnlineRetail.xlsx')
Exploratory Data Analysis (EDA)#

Take a First Look of our Data:#
I created the function below to simplify the analysis of general characteristics of the data. Inspired on the str function of R, this function returns the types, counts, distinct, count nulls, missing ratio and uniques values of each field/feature.
If the study involve some supervised learning, this function can return the study of the correlation, for this we just need provide the dependent variable to the pred parameter.
Also, if its return is stored in a variable you can evaluate it in more detail, focus on specific field, or sort them from different perspectives.
import pandas as pd
def rstr(df, pred=None):
obs = df.shape[0]
types = df.dtypes
counts = df.apply(lambda x: x.count())
uniques = df.apply(lambda x: x.nunique())
nulls = df.isnull().sum()
distincts = df.apply(lambda x: x.unique().shape[0])
missing_ratio = (df.isnull().sum() / obs) * 100
# Calculate skewness and kurtosis only for numeric columns
numeric_cols = df.select_dtypes(include=['number']).columns
skewness = df[numeric_cols].skew()
kurtosis = df[numeric_cols].kurt()
print('Data shape:', df.shape)
if pred is None:
cols = ['types', 'counts', 'distincts', 'nulls', 'missing ratio', 'uniques', 'skewness', 'kurtosis']
str_summary = pd.concat([types, counts, distincts, nulls, missing_ratio, uniques, skewness, kurtosis], axis=1, sort=True)
else:
corr = df.corr()[pred]
str_summary = pd.concat([types, counts, distincts, nulls, missing_ratio, uniques, skewness, kurtosis, corr], axis=1, sort=True)
corr_col = 'corr ' + pred
cols = ['types', 'counts', 'distincts', 'nulls', 'missing ratio', 'uniques', 'skewness', 'kurtosis', corr_col]
str_summary.columns = cols
dtypes = str_summary.types.value_counts()
print('_________\nData types:\n', dtypes)
print('_________')
return str_summary
details = rstr(cs_df)
display(details.sort_values(by='missing ratio', ascending=False))
Data shape: (541909, 8)
_________
Data types:
types
object 4
float64 2
datetime64[ns] 1
int64 1
Name: count, dtype: int64
_________
| types | counts | distincts | nulls | missing ratio | uniques | skewness | kurtosis | |
|---|---|---|---|---|---|---|---|---|
| CustomerID | float64 | 406829 | 4373 | 135080 | 24.926694 | 4372 | 0.029835 | -1.179982 |
| Description | object | 540455 | 4224 | 1454 | 0.268311 | 4223 | NaN | NaN |
| Country | object | 541909 | 38 | 0 | 0.000000 | 38 | NaN | NaN |
| InvoiceDate | datetime64[ns] | 541909 | 23260 | 0 | 0.000000 | 23260 | NaN | NaN |
| InvoiceNo | object | 541909 | 25900 | 0 | 0.000000 | 25900 | NaN | NaN |
| Quantity | int64 | 541909 | 722 | 0 | 0.000000 | 722 | -0.264076 | 119769.160031 |
| StockCode | object | 541909 | 4070 | 0 | 0.000000 | 4070 | NaN | NaN |
| UnitPrice | float64 | 541909 | 1630 | 0 | 0.000000 | 1630 | 186.506972 | 59005.719097 |
Let’s see the description of each column:
InvoiceNo: A unique identifier for the invoice. An invoice number shared across rows means that those transactions were performed in a single invoice (multiple purchases).
StockCode: Identifier for items contained in an invoice.
Description: Textual description of each of the stock item.
Quantity: The quantity of the item purchased.
InvoiceDate: Date of purchase.
UnitPrice: Value of each item.
CustomerID: Identifier for customer making the purchase.
Country: Country of customer.
cs_df.describe()
| Quantity | InvoiceDate | UnitPrice | CustomerID | |
|---|---|---|---|---|
| count | 541909.000000 | 541909 | 541909.000000 | 406829.000000 |
| mean | 9.552250 | 2011-07-04 13:34:57.156386048 | 4.611114 | 15287.690570 |
| min | -80995.000000 | 2010-12-01 08:26:00 | -11062.060000 | 12346.000000 |
| 25% | 1.000000 | 2011-03-28 11:34:00 | 1.250000 | 13953.000000 |
| 50% | 3.000000 | 2011-07-19 17:17:00 | 2.080000 | 15152.000000 |
| 75% | 10.000000 | 2011-10-19 11:27:00 | 4.130000 | 16791.000000 |
| max | 80995.000000 | 2011-12-09 12:50:00 | 38970.000000 | 18287.000000 |
| std | 218.081158 | NaN | 96.759853 | 1713.600303 |
We can observe from the preceding output that Quantity and UnitPrice are having negative values, which may mean that we may have some return transactions in our data also. As our goal is customer segmentation and market basket analysis, it is important that these records are removed, but first we will take a look at whether there are records where both are negative or if one of them is negative and the other is zero.
print('Check if we had negative quantity and prices at same register:',
'No' if cs_df[(cs_df.Quantity<0) & (cs_df.UnitPrice<0)].shape[0] == 0 else 'Yes', '\n')
print('Check how many register we have where quantity is negative',
'and prices is 0 or vice-versa:',
cs_df[(cs_df.Quantity<=0) & (cs_df.UnitPrice<=0)].shape[0])
print('\nWhat is the customer ID of the registers above:',
cs_df.loc[(cs_df.Quantity<=0) & (cs_df.UnitPrice<=0),
['CustomerID']].CustomerID.unique())
print('\n% Negative Quantity: {:3.2%}'.format(cs_df[(cs_df.Quantity<0)].shape[0]/cs_df.shape[0]))
print('\nAll register with negative quantity has Invoice start with:',
cs_df.loc[(cs_df.Quantity<0) & ~(cs_df.CustomerID.isnull()), 'InvoiceNo'].apply(lambda x: x[0]).unique())
print('\nSee an example of negative quantity and others related records:')
display(cs_df[(cs_df.CustomerID==12472) & (cs_df.StockCode==22244)])
Check if we had negative quantity and prices at same register: No
Check how many register we have where quantity is negative and prices is 0 or vice-versa: 1336
What is the customer ID of the registers above: [nan]
% Negative Quantity: 1.96%
All register with negative quantity has Invoice start with: ['C']
See an example of negative quantity and others related records:
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 1973 | C536548 | 22244 | 3 HOOK HANGER MAGIC GARDEN | -4 | 2010-12-01 14:33:00 | 1.95 | 12472.0 | Germany |
| 9438 | 537201 | 22244 | 3 HOOK HANGER MAGIC GARDEN | 12 | 2010-12-05 14:19:00 | 1.95 | 12472.0 | Germany |
| 121980 | 546843 | 22244 | 3 HOOK HANGER MAGIC GARDEN | 12 | 2011-03-17 12:40:00 | 1.95 | 12472.0 | Germany |
print('Check register with UnitPrice negative:')
display(cs_df[(cs_df.UnitPrice<0)])
print("Sales records with Customer ID and zero in Unit Price:",cs_df[(cs_df.UnitPrice==0) & ~(cs_df.CustomerID.isnull())].shape[0])
cs_df[(cs_df.UnitPrice==0) & ~(cs_df.CustomerID.isnull())]
Check register with UnitPrice negative:
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 299983 | A563186 | B | Adjust bad debt | 1 | 2011-08-12 14:51:00 | -11062.06 | NaN | United Kingdom |
| 299984 | A563187 | B | Adjust bad debt | 1 | 2011-08-12 14:52:00 | -11062.06 | NaN | United Kingdom |
Sales records with Customer ID and zero in Unit Price: 40
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 9302 | 537197 | 22841 | ROUND CAKE TIN VINTAGE GREEN | 1 | 2010-12-05 14:02:00 | 0.0 | 12647.0 | Germany |
| 33576 | 539263 | 22580 | ADVENT CALENDAR GINGHAM SACK | 4 | 2010-12-16 14:36:00 | 0.0 | 16560.0 | United Kingdom |
| 40089 | 539722 | 22423 | REGENCY CAKESTAND 3 TIER | 10 | 2010-12-21 13:45:00 | 0.0 | 14911.0 | EIRE |
| 47068 | 540372 | 22090 | PAPER BUNTING RETROSPOT | 24 | 2011-01-06 16:41:00 | 0.0 | 13081.0 | United Kingdom |
| 47070 | 540372 | 22553 | PLASTERS IN TIN SKULLS | 24 | 2011-01-06 16:41:00 | 0.0 | 13081.0 | United Kingdom |
| 56674 | 541109 | 22168 | ORGANISER WOOD ANTIQUE WHITE | 1 | 2011-01-13 15:10:00 | 0.0 | 15107.0 | United Kingdom |
| 86789 | 543599 | 84535B | FAIRY CAKES NOTEBOOK A6 SIZE | 16 | 2011-02-10 13:08:00 | 0.0 | 17560.0 | United Kingdom |
| 130188 | 547417 | 22062 | CERAMIC BOWL WITH LOVE HEART DESIGN | 36 | 2011-03-23 10:25:00 | 0.0 | 13239.0 | United Kingdom |
| 139453 | 548318 | 22055 | MINI CAKE STAND HANGING STRAWBERY | 5 | 2011-03-30 12:45:00 | 0.0 | 13113.0 | United Kingdom |
| 145208 | 548871 | 22162 | HEART GARLAND RUSTIC PADDED | 2 | 2011-04-04 14:42:00 | 0.0 | 14410.0 | United Kingdom |
| 157042 | 550188 | 22636 | CHILDS BREAKFAST SET CIRCUS PARADE | 1 | 2011-04-14 18:57:00 | 0.0 | 12457.0 | Switzerland |
| 187613 | 553000 | 47566 | PARTY BUNTING | 4 | 2011-05-12 15:21:00 | 0.0 | 17667.0 | United Kingdom |
| 198383 | 554037 | 22619 | SET OF 6 SOLDIER SKITTLES | 80 | 2011-05-20 14:13:00 | 0.0 | 12415.0 | Australia |
| 279324 | 561284 | 22167 | OVAL WALL MIRROR DIAMANTE | 1 | 2011-07-26 12:24:00 | 0.0 | 16818.0 | United Kingdom |
| 282912 | 561669 | 22960 | JAM MAKING SET WITH JARS | 11 | 2011-07-28 17:09:00 | 0.0 | 12507.0 | Spain |
| 285657 | 561916 | M | Manual | 1 | 2011-08-01 11:44:00 | 0.0 | 15581.0 | United Kingdom |
| 298054 | 562973 | 23157 | SET OF 6 NATIVITY MAGNETS | 240 | 2011-08-11 11:42:00 | 0.0 | 14911.0 | EIRE |
| 314745 | 564651 | 23270 | SET OF 2 CERAMIC PAINTED HEARTS | 96 | 2011-08-26 14:19:00 | 0.0 | 14646.0 | Netherlands |
| 314746 | 564651 | 23268 | SET OF 2 CERAMIC CHRISTMAS REINDEER | 192 | 2011-08-26 14:19:00 | 0.0 | 14646.0 | Netherlands |
| 314747 | 564651 | 22955 | 36 FOIL STAR CAKE CASES | 144 | 2011-08-26 14:19:00 | 0.0 | 14646.0 | Netherlands |
| 314748 | 564651 | 21786 | POLKADOT RAIN HAT | 144 | 2011-08-26 14:19:00 | 0.0 | 14646.0 | Netherlands |
| 358655 | 568158 | PADS | PADS TO MATCH ALL CUSHIONS | 1 | 2011-09-25 12:22:00 | 0.0 | 16133.0 | United Kingdom |
| 361825 | 568384 | M | Manual | 1 | 2011-09-27 09:46:00 | 0.0 | 12748.0 | United Kingdom |
| 379913 | 569716 | 22778 | GLASS CLOCHE SMALL | 2 | 2011-10-06 08:17:00 | 0.0 | 15804.0 | United Kingdom |
| 395529 | 571035 | M | Manual | 1 | 2011-10-13 12:50:00 | 0.0 | 12446.0 | RSA |
| 420404 | 572893 | 21208 | PASTEL COLOUR HONEYCOMB FAN | 5 | 2011-10-26 14:36:00 | 0.0 | 18059.0 | United Kingdom |
| 436428 | 574138 | 23234 | BISCUIT TIN VINTAGE CHRISTMAS | 216 | 2011-11-03 11:26:00 | 0.0 | 12415.0 | Australia |
| 436597 | 574175 | 22065 | CHRISTMAS PUDDING TRINKET POT | 12 | 2011-11-03 11:47:00 | 0.0 | 14110.0 | United Kingdom |
| 436961 | 574252 | M | Manual | 1 | 2011-11-03 13:24:00 | 0.0 | 12437.0 | France |
| 439361 | 574469 | 22385 | JUMBO BAG SPACEBOY DESIGN | 12 | 2011-11-04 11:55:00 | 0.0 | 12431.0 | Australia |
| 446125 | 574879 | 22625 | RED KITCHEN SCALES | 2 | 2011-11-07 13:22:00 | 0.0 | 13014.0 | United Kingdom |
| 446793 | 574920 | 22899 | CHILDREN'S APRON DOLLY GIRL | 1 | 2011-11-07 16:34:00 | 0.0 | 13985.0 | United Kingdom |
| 446794 | 574920 | 23480 | MINI LIGHTS WOODLAND MUSHROOMS | 1 | 2011-11-07 16:34:00 | 0.0 | 13985.0 | United Kingdom |
| 454463 | 575579 | 22437 | SET OF 9 BLACK SKULL BALLOONS | 20 | 2011-11-10 11:49:00 | 0.0 | 13081.0 | United Kingdom |
| 454464 | 575579 | 22089 | PAPER BUNTING VINTAGE PAISLEY | 24 | 2011-11-10 11:49:00 | 0.0 | 13081.0 | United Kingdom |
| 479079 | 577129 | 22464 | HANGING METAL HEART LANTERN | 4 | 2011-11-17 19:52:00 | 0.0 | 15602.0 | United Kingdom |
| 479546 | 577168 | M | Manual | 1 | 2011-11-18 10:42:00 | 0.0 | 12603.0 | Germany |
| 480649 | 577314 | 23407 | SET OF 2 TRAYS HOME SWEET HOME | 2 | 2011-11-18 13:23:00 | 0.0 | 12444.0 | Norway |
| 485985 | 577696 | M | Manual | 1 | 2011-11-21 11:57:00 | 0.0 | 16406.0 | United Kingdom |
| 502122 | 578841 | 84826 | ASSTD DESIGN 3D PAPER STICKERS | 12540 | 2011-11-25 15:57:00 | 0.0 | 13256.0 | United Kingdom |
As you can see, there are no records where quantity and price are negative, but there are 1.336 records where one of them is and the other is 0. However, note that for all these records we do not have the customer ID. So we conclude that we can erase all records in that quantity or the price and negative. In addition, by the foregoing summary we see that there are 135,080 records without customer identification that we may also disregard.
# Remove rows with missing CustomerID
cs_df = cs_df.dropna(subset=['CustomerID'])
# Remove negative or return transactions
cs_df = cs_df[(cs_df['Quantity'] >= 0) & (cs_df['UnitPrice'] > 0)]
details = rstr(cs_df)
display(details.sort_values(by='distincts', ascending=False))
Data shape: (397884, 8)
_________
Data types:
types
object 4
float64 2
datetime64[ns] 1
int64 1
Name: count, dtype: int64
_________
| types | counts | distincts | nulls | missing ratio | uniques | skewness | kurtosis | |
|---|---|---|---|---|---|---|---|---|
| InvoiceNo | object | 397884 | 18532 | 0 | 0.0 | 18532 | NaN | NaN |
| InvoiceDate | datetime64[ns] | 397884 | 17282 | 0 | 0.0 | 17282 | NaN | NaN |
| CustomerID | float64 | 397884 | 4338 | 0 | 0.0 | 4338 | 0.025729 | -1.180822 |
| Description | object | 397884 | 3877 | 0 | 0.0 | 3877 | NaN | NaN |
| StockCode | object | 397884 | 3665 | 0 | 0.0 | 3665 | NaN | NaN |
| UnitPrice | float64 | 397884 | 440 | 0 | 0.0 | 440 | 204.032727 | 58140.396673 |
| Quantity | int64 | 397884 | 301 | 0 | 0.0 | 301 | 409.892972 | 178186.243253 |
| Country | object | 397884 | 37 | 0 | 0.0 | 37 | NaN | NaN |
After this first cleanup, note that we still have more description than inventory codes, so we still have some inconsistency on the basis that requires further investigation. Let´s see it:
cat_des_df = cs_df.groupby(["StockCode","Description"]).count().reset_index()
display(cat_des_df.StockCode.value_counts()[cat_des_df.StockCode.value_counts()>1].reset_index().head())
cs_df[cs_df['StockCode'] == cat_des_df.StockCode.value_counts()[cat_des_df.StockCode.value_counts()>1]
.reset_index()['index'][4]]['Description'].unique()
| StockCode | count | |
|---|---|---|
| 0 | 23196 | 4 |
| 1 | 23236 | 4 |
| 2 | 23203 | 3 |
| 3 | 17107D | 3 |
| 4 | 23535 | 3 |
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
3804 try:
-> 3805 return self._engine.get_loc(casted_key)
3806 except KeyError as err:
File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'index'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[10], line 3
1 cat_des_df = cs_df.groupby(["StockCode","Description"]).count().reset_index()
2 display(cat_des_df.StockCode.value_counts()[cat_des_df.StockCode.value_counts()>1].reset_index().head())
----> 3 cs_df[cs_df['StockCode'] == cat_des_df.StockCode.value_counts()[cat_des_df.StockCode.value_counts()>1]
4 .reset_index()['index'][4]]['Description'].unique()
File ~/.local/lib/python3.10/site-packages/pandas/core/frame.py:4090, in DataFrame.__getitem__(self, key)
4088 if self.columns.nlevels > 1:
4089 return self._getitem_multilevel(key)
-> 4090 indexer = self.columns.get_loc(key)
4091 if is_integer(indexer):
4092 indexer = [indexer]
File ~/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
3807 if isinstance(casted_key, slice) or (
3808 isinstance(casted_key, abc.Iterable)
3809 and any(isinstance(x, slice) for x in casted_key)
3810 ):
3811 raise InvalidIndexError(key)
-> 3812 raise KeyError(key) from err
3813 except TypeError:
3814 # If we have a listlike key, _check_indexing_error will raise
3815 # InvalidIndexError. Otherwise we fall through and re-raise
3816 # the TypeError.
3817 self._check_indexing_error(key)
KeyError: 'index'
This gives the multiple descriptions for one of those items and we witness the simple ways in which data quality can be corrupted in any dataset. A simple spelling mistake can end up in reducing data quality and an erroneous analysis.
unique_desc = cs_df[["StockCode", "Description"]].groupby(by=["StockCode"]).\
apply(pd.DataFrame.mode).reset_index(drop=True)
q = '''
select df.InvoiceNo, df.StockCode, un.Description, df.Quantity, df.InvoiceDate,
df.UnitPrice, df.CustomerID, df.Country
from cs_df as df INNER JOIN
unique_desc as un on df.StockCode = un.StockCode
'''
cs_df = pysqldf(q)
cs_df.InvoiceDate = pd.to_datetime(cs_df.InvoiceDate)
cs_df['amount'] = cs_df.Quantity*cs_df.UnitPrice
cs_df.CustomerID = cs_df.CustomerID.astype('Int64')
details = rstr(cs_df)
display(details.sort_values(by='distincts', ascending=False))
Data shape: (397884, 9)
_________
Data types:
types
object 3
int64 2
float64 2
Int64 1
datetime64[ns] 1
Name: count, dtype: int64
_________
| types | counts | distincts | nulls | missing ratio | uniques | skewness | kurtosis | |
|---|---|---|---|---|---|---|---|---|
| InvoiceNo | int64 | 397884 | 18532 | 0 | 0.0 | 18532 | -0.178524 | -1.200748 |
| InvoiceDate | datetime64[ns] | 397884 | 17282 | 0 | 0.0 | 17282 | <NA> | <NA> |
| CustomerID | Int64 | 397884 | 4338 | 0 | 0.0 | 4338 | 0.025729 | -1.180822 |
| StockCode | object | 397884 | 3665 | 0 | 0.0 | 3665 | <NA> | <NA> |
| Description | object | 397884 | 3647 | 0 | 0.0 | 3647 | <NA> | <NA> |
| amount | float64 | 397884 | 2939 | 0 | 0.0 | 2939 | 451.443182 | 232155.117219 |
| UnitPrice | float64 | 397884 | 440 | 0 | 0.0 | 440 | 204.032727 | 58140.396673 |
| Quantity | int64 | 397884 | 301 | 0 | 0.0 | 301 | 409.892972 | 178186.243253 |
| Country | object | 397884 | 37 | 0 | 0.0 | 37 | <NA> | <NA> |
fig = plt.figure(figsize=(25, 7))
f1 = fig.add_subplot(121)
g = cs_df.groupby(["Country"]).amount.sum().sort_values(ascending = False).plot(kind='bar', title='Amount Sales by Country')
cs_df['Internal'] = cs_df.Country.apply(lambda x: 'Yes' if x=='United Kingdom' else 'No' )
f2 = fig.add_subplot(122)
market = cs_df.groupby(["Internal"]).amount.sum().sort_values(ascending = False)
g = plt.pie(market, labels=market.index, autopct='%1.1f%%', shadow=True, startangle=90)
plt.title('Internal Market')
plt.show()
fig = plt.figure(figsize=(25, 7))
PercentSales = np.round((cs_df.groupby(["CustomerID"]).amount.sum().\
sort_values(ascending = False)[:51].sum()/cs_df.groupby(["CustomerID"]).\
amount.sum().sort_values(ascending = False).sum()) * 100, 2)
g = cs_df.groupby(["CustomerID"]).amount.sum().sort_values(ascending = False)[:51].\
plot(kind='bar', title='Top Customers: {:3.2f}% Sales Amount'.format(PercentSales))
fig = plt.figure(figsize=(25, 7))
f1 = fig.add_subplot(121)
PercentSales = np.round((cs_df.groupby(["CustomerID"]).amount.sum().\
sort_values(ascending = False)[:10].sum()/cs_df.groupby(["CustomerID"]).\
amount.sum().sort_values(ascending = False).sum()) * 100, 2)
g = cs_df.groupby(["CustomerID"]).amount.sum().sort_values(ascending = False)[:10]\
.plot(kind='bar', title='Top 10 Customers: {:3.2f}% Sales Amont'.format(PercentSales))
f1 = fig.add_subplot(122)
PercentSales = np.round((cs_df.groupby(["CustomerID"]).amount.count().\
sort_values(ascending = False)[:10].sum()/cs_df.groupby(["CustomerID"]).\
amount.count().sort_values(ascending = False).sum()) * 100, 2)
g = cs_df.groupby(["CustomerID"]).amount.count().sort_values(ascending = False)[:10].\
plot(kind='bar', title='Top 10 Customers: {:3.2f}% Event Sales'.format(PercentSales))
AmoutSum = cs_df.groupby(["Description"]).amount.sum().sort_values(ascending = False)
inv = cs_df[["Description", "InvoiceNo"]].groupby(["Description"]).InvoiceNo.unique().\
agg(np.size).sort_values(ascending = False)
fig = plt.figure(figsize=(25, 7))
f1 = fig.add_subplot(121)
Top10 = list(AmoutSum[:10].index)
PercentSales = np.round((AmoutSum[Top10].sum()/AmoutSum.sum()) * 100, 2)
PercentEvents = np.round((inv[Top10].sum()/inv.sum()) * 100, 2)
g = AmoutSum[Top10].\
plot(kind='bar', title='Top 10 Products in Sales Amount: {:3.2f}% of Amount and {:3.2f}% of Events'.\
format(PercentSales, PercentEvents))
f1 = fig.add_subplot(122)
Top10Ev = list(inv[:10].index)
PercentSales = np.round((AmoutSum[Top10Ev].sum()/AmoutSum.sum()) * 100, 2)
PercentEvents = np.round((inv[Top10Ev].sum()/inv.sum()) * 100, 2)
g = inv[Top10Ev].\
plot(kind='bar', title='Events of top 10 most sold products: {:3.2f}% of Amount and {:3.2f}% of Events'.\
format(PercentSales, PercentEvents))
fig = plt.figure(figsize=(25, 7))
Top15ev = list(inv[:15].index)
PercentSales = np.round((AmoutSum[Top15ev].sum()/AmoutSum.sum()) * 100, 2)
PercentEvents = np.round((inv[Top15ev].sum()/inv.sum()) * 100, 2)
g = AmoutSum[Top15ev].sort_values(ascending = False).\
plot(kind='bar',
title='Sales Amount of top 15 most sold products: {:3.2f}% of Amount and {:3.2f}% of Events'.\
format(PercentSales, PercentEvents))
fig = plt.figure(figsize=(25, 7))
Top50 = list(AmoutSum[:50].index)
PercentSales = np.round((AmoutSum[Top50].sum()/AmoutSum.sum()) * 100, 2)
PercentEvents = np.round((inv[Top50].sum()/inv.sum()) * 100, 2)
g = AmoutSum[Top50].\
plot(kind='bar',
title='Top 50 Products in Sales Amount: {:3.2f}% of Amount and {:3.2f}% of Events'.\
format(PercentSales, PercentEvents))
fig = plt.figure(figsize=(25, 7))
Top50Ev = list(inv[:50].index)
PercentSales = np.round((AmoutSum[Top50Ev].sum()/AmoutSum.sum()) * 100, 2)
PercentEvents = np.round((inv[Top50Ev].sum()/inv.sum()) * 100, 2)
g = inv[Top50Ev].\
plot(kind='bar', title='Top 50 most sold products: {:3.2f}% of Amount and {:3.2f}% of Events'.\
format(PercentSales, PercentEvents))
Customer Segmentation:#
Customer segmentation is similarly the process of dividing an organization’s customer bases into different sections or segments based on various customer attributes. The process of customer segmentation is based on the premise of finding differences among the customers’ behavior and patterns.
The major objectives and benefits behind the motivation for customer segmentation are:
Higher Revenue: This is the most obvious requirement of any customer segmentation project.
Customer Understanding: One of the mostly widely accepted business paradigms is “know your customer” and a segmentation of the customer base allows for a perfect dissection of this paradigm.
Target Marketing: The most visible reason for customer segmentation is the ability to focus marketing efforts effectively and efficiently. If a firm knows the different segments of its customer base, it can devise better marketing campaigns which are tailor made for the segment. A good segmentation model allows for better understanding of customer requirements and hence increases the chances of the success of any marketing campaign developed by the organization.
Optimal Product Placement: A good customer segmentation strategy can also help the firm with developing or offering new products, or a bundle of products together as a combined offering.
Finding Latent Customer Segments: Finding out which segment of customers it might be missing to identifying untapped customer segments by focused on marketing campaigns or new business development.
Clustering:
The most obvious method to perform customer segmentation is using unsupervised Machine Learning methods like clustering. The method is as simple as collecting as much data about the customers as possible in the form of features or attributes and then finding out the different clusters that can be obtained from that data. Finally, we can find traits of customer segments by analyzing the characteristics of the clusters.
Exploratory Data Analysis:
Using exploratory data analysis is another way of finding out customer segments. This is usually done by analysts who have a good knowledge about the domain relevant to both products and customers. It can be done flexibly to include the top decision points in an analysis.
RFM Model for Customer Value:#
Since our dataset is limited to the sales records, and didn´t include anothers information about our customers, we will use a RFM,*Recency, Frequency and Monetary Value, based model of customer value for finding our customer segments. The RFM model will take the transactions of a customer and calculate three important informational attributes about each customer:
Recency: The value of how recently a customer purchased at the establishment
Frequency: How frequent the customer’s transactions are at the establishment
Monetary value: The dollar (or pounds in our case) value of all the transactions that the customer made at the establishment
Recency#
To create the recency feature variable, we need to decide the reference date for our analysis. Usually, we make use of the last transaction date plus one day. Then, we will construct the recency variable as the number of days before the reference date when a customer last made a purchase.
import datetime
import pandas as pd
# Calculate reference date
reference_date = cs_df.InvoiceDate.max() + pd.Timedelta(days=1)
print('Reference Date:', reference_date)
# Calculate days since last purchase
cs_df['days_since_last_purchase'] = (reference_date - cs_df.InvoiceDate) / pd.Timedelta(1, 'D')
# Calculate customer recency
customer_history_df = cs_df[['CustomerID', 'days_since_last_purchase']].groupby("CustomerID").min().reset_index()
customer_history_df.rename(columns={'days_since_last_purchase': 'recency'}, inplace=True)
# Display summary statistics
customer_history_df.describe().transpose()
Reference Date: 2011-12-10 12:50:00
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CustomerID | 4338.0 | 15300.408022 | 1721.808492 | 12346.0 | 13813.25 | 15299.5 | 16778.75 | 18287.0 |
| recency | 4338.0 | 93.049317 | 100.013298 | 1.0 | 18.072396 | 51.089931 | 142.730556 | 374.122917 |
We will plot the Recency Distribution and QQ-plot to identify substantive departures from normality, likes outliers, skewness and kurtosis.
def QQ_plot(data, measure):
fig = plt.figure(figsize=(20,7))
#Get the fitted parameters used by the function
(mu, sigma) = norm.fit(data)
#Kernel Density plot
fig1 = fig.add_subplot(121)
sns.distplot(data, fit=norm)
fig1.set_title(measure + ' Distribution ( mu = {:.2f} and sigma = {:.2f} )'.format(mu, sigma), loc='center')
fig1.set_xlabel(measure)
fig1.set_ylabel('Frequency')
#QQ plot
fig2 = fig.add_subplot(122)
res = probplot(data, plot=fig2)
fig2.set_title(measure + ' Probability Plot (skewness: {:.6f} and kurtosis: {:.6f} )'.format(data.skew(), data.kurt()), loc='center')
plt.tight_layout()
plt.show()
QQ_plot(customer_history_df.recency, 'Recency')
From the first graph above we can see that sales recency distribution is skewed, has a peak on the left and a long tail to the right. It deviates from normal distribution and is positively biased.
From the Probability Plot, we could see that sales recency also does not align with the diagonal red line which represent normal distribution. The form of its distribution confirm that is a skewed right.
With skewness positive of 1.25, we confirm the lack of symmetry and indicate that sales recency are skewed right, as we can see too at the Sales Distribution plot, skewed right means that the right tail is long relative to the left tail. The skewness for a normal distribution is zero, and any symmetric data should have a skewness near zero. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers, and positive kurtosis indicates a heavy-tailed distribution and negative kurtosis indicates a light tailed distribution. So, with 0.43 of positive kurtosis sales recency are heavy-tailed and has some outliers.
Frequency#
customer_freq = (cs_df[['CustomerID', 'InvoiceNo']].groupby(["CustomerID", 'InvoiceNo']).count().reset_index()).\
groupby(["CustomerID"]).count().reset_index()
customer_freq.rename(columns={'InvoiceNo':'frequency'},inplace=True)
customer_history_df = customer_history_df.merge(customer_freq)
QQ_plot(customer_history_df.frequency, 'Frequency')
From the first graph above we can see that sales frequency distribution is skewed, has a peak on the left and a long tail to the right. It deviates from normal distribution and is positively biased.
From the Probability Plot, we could see that sales frequency also does **not align with the diagonal and confirm that is a skewed right.
With skewness positive of 12.1, we confirm the high lack of symmetry and with 249 Kurtosis indicates that is a heavy-tailed distribution and has outliers.
Monetary Value#
customer_monetary_val = cs_df[['CustomerID', 'amount']].groupby("CustomerID").sum().reset_index()
customer_history_df = customer_history_df.merge(customer_monetary_val)
QQ_plot(customer_history_df.amount, 'Amount')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[2], line 1
----> 1 customer_monetary_val = cs_df[['CustomerID', 'amount']].groupby("CustomerID").sum().reset_index()
2 customer_history_df = customer_history_df.merge(customer_monetary_val)
3 QQ_plot(customer_history_df.amount, 'Amount')
NameError: name 'cs_df' is not defined
From the first graph above we can see that sales amount distribution is skewed, has a peak on the left and a long tail to the right. It deviates from normal distribution and is positively biased.
From the Probability Plot, we could see that sales amount also does not align with the diagonal, special on the right.
With skewness positive of 19.3, we confirm the high lack of symmetry and with 478 Kurtosis indicates that is a too heavy-tailed distribution and has outliers, surely more than 10 very extreme.
Let’s see a statistical summary of this dataset:
customer_history_df.describe()
| CustomerID | recency | frequency | amount | |
|---|---|---|---|---|
| count | 4338.0 | 4338.000000 | 4338.000000 | 4338.000000 |
| mean | 15300.408022 | 93.049317 | 4.272015 | 2054.266460 |
| std | 1721.808492 | 100.013298 | 7.697998 | 8989.230441 |
| min | 12346.0 | 1.000000 | 1.000000 | 3.750000 |
| 25% | 13813.25 | 18.072396 | 1.000000 | 307.415000 |
| 50% | 15299.5 | 51.089931 | 2.000000 | 674.485000 |
| 75% | 16778.75 | 142.730556 | 5.000000 | 1661.740000 |
| max | 18287.0 | 374.122917 | 209.000000 | 280206.020000 |
Data Preprocessing#
Once we have created our customer value dataset, we will perform some preprocessing on the data. For our clustering, we will be using the K-means clustering algorithm. One of the requirements for proper functioning of the algorithm is the mean centering of the variable values. Mean centering of a variable value means that we will replace the actual value of the variable with a standardized value, so that the variable has a mean of 0 and variance of 1. This ensures that all the variables are in the same range and the difference in ranges of values doesn’t cause the algorithm to not perform well. This is akin to feature scaling.
Another problem that you can investigate about is the huge range of values each variable can take. This problem is particularly noticeable for the monetary amount variable. To take care of this problem, we will transform all the variables on the log scale. This transformation, along with the standardization, will ensure that the input to our algorithm is a homogenous set of scaled and transformed values.
An important point about the data preprocessing step is that sometimes we need it to be reversible. In our case, we will have the clustering results in terms of the log transformed and scaled variable. But to make inferences in terms of the original data, we will need to reverse transform all the variable so that we get back the actual RFM figures. This can be done by using the preprocessing capabilities of Python.
customer_history_df['recency_log'] = customer_history_df['recency'].apply(math.log)
customer_history_df['frequency_log'] = customer_history_df['frequency'].apply(math.log)
customer_history_df['amount_log'] = customer_history_df['amount'].apply(math.log)
feature_vector = ['amount_log', 'recency_log','frequency_log']
X_subset = customer_history_df[feature_vector] #.as_matrix()
scaler = preprocessing.StandardScaler().fit(X_subset)
X_scaled = scaler.transform(X_subset)
pd.DataFrame(X_scaled, columns=X_subset.columns).describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| amount_log | 4338.0 | -7.010426e-16 | 1.000115 | -4.179280 | -0.684183 | -0.060942 | 0.654244 | 4.721395 |
| recency_log | 4338.0 | 9.827700e-18 | 1.000115 | -2.747518 | -0.655196 | 0.096033 | 0.838707 | 1.535305 |
| frequency_log | 4338.0 | -9.991495e-17 | 1.000115 | -1.048610 | -1.048610 | -0.279044 | 0.738267 | 4.882714 |
fig = plt.figure(figsize=(20,14))
f1 = fig.add_subplot(221); sns.regplot(x='recency', y='amount', data=customer_history_df)
f1 = fig.add_subplot(222); sns.regplot(x='frequency', y='amount', data=customer_history_df)
f1 = fig.add_subplot(223); sns.regplot(x='recency_log', y='amount_log', data=customer_history_df)
f1 = fig.add_subplot(224); sns.regplot(x='frequency_log', y='amount_log', data=customer_history_df)
fig = plt.figure(figsize=(15, 10))
ax = fig.add_subplot(111, projection='3d')
xs =customer_history_df.recency_log
ys = customer_history_df.frequency_log
zs = customer_history_df.amount_log
ax.scatter(xs, ys, zs, s=5)
ax.set_xlabel('Recency')
ax.set_ylabel('Frequency')
ax.set_zlabel('Monetary')
plt.show()
The obvious patterns we can see from the plots above is that costumers who buy with a higher frequency and more recency tend to spend more based on the increasing trend in Monetary (amount value) with a corresponding increasing and decreasing trend for Frequency and Recency, respectively.
Clustering for Segments#
K-Means Clustering#
The K-means clustering belongs to the partition based\centroid based hard clustering family of algorithms, a family of algorithms where each sample in a dataset is assigned to exactly one cluster.
Based on this Euclidean distance metric, we can describe the k-means algorithm as a simple optimization problem, an iterative approach for minimizing the within-cluster sum of squared errors (SSE), which is sometimes also called cluster inertia. So, the objective of K-Means clustering is to minimize total intra-cluster variance, or, the squared error function:

The steps that happen in the K-means algorithm for partitioning the data are as given follows:
The algorithm starts with random point initializations of the required number of centers. The “K” in K-means stands for the number of clusters.
In the next step, each of the data point is assigned to the center closest to it. The distance metric used in K-means clustering is normal Euclidian distance.
Once the data points are assigned, the centers are recalculated by averaging the dimensions of the points belonging to the cluster.
The process is repeated with new centers until we reach a point where the assignments become stable. In this case, the algorithm terminates.
K-means++#
Place the initial centroids far away from each other via the k-means++ algorithm, which leads to better and more consistent results than the classic k-means.
To use k-means++ with scikit-learn’s KMeans object, we just need to set the init parameter to k-means++ (the default setting) instead of random.
The Elbow Method#
Using the elbow method to find the optimal number of clusters. The idea behind the elbow method is to identify the value of k where the distortion begins to increase most rapidly. If k increases, the distortion will decrease, because the samples will be closer to the centroids they are assigned to.
This method looks at the percentage of variance explained as a function of the number of clusters. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the “elbow criterion”. This “elbow” cannot always be unambiguously identified.Percentage of variance explained is the ratio of the between-group variance to the total variance, also known as an F-test. A slight variation of this method plots the curvature of the within group variance.
cl = 50
corte = 0.1
anterior = 100000000000000
cost = []
K_best = cl
for k in range (1, cl+1):
# Create a kmeans model on our data, using k clusters. random_state helps ensure that the algorithm returns the same results each time.
model = KMeans(
n_clusters=k,
init='k-means++', #'random',
n_init=10,
max_iter=300,
tol=1e-04,
random_state=101)
model = model.fit(X_scaled)
# These are our fitted labels for clusters -- the first cluster has label 0, and the second has label 1.
labels = model.labels_
# Sum of distances of samples to their closest cluster center
interia = model.inertia_
if (K_best == cl) and (((anterior - interia)/anterior) < corte): K_best = k - 1
cost.append(interia)
anterior = interia
plt.figure(figsize=(8, 6))
plt.scatter(range (1, cl+1), cost, c='red')
plt.show()
# Create a kmeans model with the best K.
print('The best K sugest: ',K_best)
model = KMeans(n_clusters=K_best, init='k-means++', n_init=10,max_iter=300, tol=1e-04, random_state=101)
# Note I'm scaling the data to normalize it! Important for good results.
model = model.fit(X_scaled)
# These are our fitted labels for clusters -- the first cluster has label 0, and the second has label 1.
labels = model.labels_
# And we'll visualize it:
#plt.scatter(X_scaled[:,0], X_scaled[:,1], c=model.labels_.astype(float))
fig = plt.figure(figsize=(20,5))
ax = fig.add_subplot(121)
plt.scatter(x = X_scaled[:,1], y = X_scaled[:,0], c=model.labels_.astype(float))
ax.set_xlabel(feature_vector[1])
ax.set_ylabel(feature_vector[0])
ax = fig.add_subplot(122)
plt.scatter(x = X_scaled[:,2], y = X_scaled[:,0], c=model.labels_.astype(float))
ax.set_xlabel(feature_vector[2])
ax.set_ylabel(feature_vector[0])
plt.show()
The best K sugest: 7
Note that by the Elbow method from a K equal to 3 we already observed low rates of gain in the decay of the distortions with the decrease of K reaching the limit of 10% with the K equal to 7. With this in mind, we will begin to evaluate the options more deeply with 3, and 7, starting with the silhouette analysis.
Silhouette analysis on K-Means clustering#
Silhouette analysis can be used to study the separation distance between the resulting clusters, as a strategy to quantifying the quality of clustering via graphical tool to plot a measure of how tightly grouped the samples in the clusters are. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually.
It can also be applied to clustering algorithms other than k-means
Silhouette coefficients has a range of [-1, 1], it calculated by:
Calculate the cluster cohesion a( i )as the average distance between a sample x( i ) and all other points in the same cluster.
Calculate the cluster separation b( i ) from the next closest cluster as the average distance between the sample x( i ) and all samples in the nearest cluster.
Calculate the silhouette s( i ) as the difference between cluster cohesion and separation divided by the greater of the two, as shown here:
Which can be also written as:
Where:
If near +1, it indicate that the sample is far away from the neighboring clusters.
a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
If most objects have a high value, then the clustering configuration is appropriate.
If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.
A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters
Negative values indicate that those samples might have been assigned to the wrong cluster.
The silhouette plot can shows a bad K clusters pick for the given data due to the presence of clusters with below average silhouette scores and also due to wide fluctuations in the size of the silhouette plots. A good k clusters can found when all the plots are more or less of similar thickness and hence are of similar sizes.
Although we have to keep in mind that in several cases and scenarios, sometimes we may have to drop the mathematical explanation given by the algorithm and look at the business relevance of the results obtained.
Let’s see below how our data perform for each K clusters groups (3, 5 and 7) in the silhouette score of each cluster, along with the center of each of the cluster discovered in the scatter plots, by amount_log vs recency_log and vs frequency_log.
cluster_centers = dict()
for n_clusters in range(3,K_best+1,2):
fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
fig.set_size_inches(25, 7)
ax1.set_xlim([-0.1, 1])
ax1.set_ylim([0, len(X_scaled) + (n_clusters + 1) * 10])
clusterer = KMeans(n_clusters=n_clusters, init='k-means++', n_init=10,max_iter=300, tol=1e-04, random_state=101)
cluster_labels = clusterer.fit_predict(X_scaled)
silhouette_avg = silhouette_score(X = X_scaled, labels = cluster_labels)
cluster_centers.update({n_clusters :{'cluster_center':clusterer.cluster_centers_,
'silhouette_score':silhouette_avg,
'labels':cluster_labels}
})
sample_silhouette_values = silhouette_samples(X = X_scaled, labels = cluster_labels)
y_lower = 10
for i in range(n_clusters):
ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.Spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([])
ax1.set_xticks([-0.1, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
colors = cm.Spectral(cluster_labels.astype(float) / n_clusters)
centers = clusterer.cluster_centers_
y = 0
x = 1
ax2.scatter(X_scaled[:, x], X_scaled[:, y], marker='.', s=30, lw=0, alpha=0.7, c=colors, edgecolor='k')
ax2.scatter(centers[:, x], centers[:, y], marker='o', c="white", alpha=1, s=200, edgecolor='k')
for i, c in enumerate(centers):
ax2.scatter(c[x], c[y], marker='$%d$' % i, alpha=1, s=50, edgecolor='k')
ax2.set_title("{} Clustered data".format(n_clusters))
ax2.set_xlabel(feature_vector[x])
ax2.set_ylabel(feature_vector[y])
x = 2
ax3.scatter(X_scaled[:, x], X_scaled[:, y], marker='.', s=30, lw=0, alpha=0.7, c=colors, edgecolor='k')
ax3.scatter(centers[:, x], centers[:, y], marker='o', c="white", alpha=1, s=200, edgecolor='k')
for i, c in enumerate(centers):
ax3.scatter(c[x], c[y], marker='$%d$' % i, alpha=1, s=50, edgecolor='k')
ax3.set_title("Silhouette score: {:1.2f}".format(cluster_centers[n_clusters]['silhouette_score']))
ax3.set_xlabel(feature_vector[x])
ax3.set_ylabel(feature_vector[y])
plt.suptitle(("Silhouette analysis for KMeans clustering on sample data with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
plt.show()
When we look at the results of the clustering process, we can infer some interesting insights:
First notice that all K clusters options is valid, because they don’t have presence of clusters with below average silhouette scores.
In the other hand, all options had a some wide fluctuations in the size of the silhouette plots.
So, the best choice may lie on the option that gives us a simpler business explanation and at the same time target customers in focus groups with sizes closer to the desired.
Clusters Center:#
Let’s look at the cluster center values after returning them to normal values from the log and scaled version.
features = ['amount', 'recency', 'frequency']
for i in range(3,K_best+1,2):
print("for {} clusters the silhouette score is {:1.2f}".format(i, cluster_centers[i]['silhouette_score']))
print("Centers of each cluster:")
cent_transformed = scaler.inverse_transform(cluster_centers[i]['cluster_center'])
print(pd.DataFrame(np.exp(cent_transformed),columns=features))
print('-'*50)
for 3 clusters the silhouette score is 0.33
Centers of each cluster:
amount recency frequency
0 955.108413 35.220128 3.042955
1 3859.014223 8.763853 9.651473
2 259.270172 119.903139 1.183191
--------------------------------------------------
for 5 clusters the silhouette score is 0.31
Centers of each cluster:
amount recency frequency
0 1161.573046 83.524872 3.155363
1 5952.818406 8.576574 13.683600
2 296.081689 31.515878 1.295759
3 1314.406149 11.107092 4.290844
4 231.257842 212.118311 1.144017
--------------------------------------------------
for 7 clusters the silhouette score is 0.31
Centers of each cluster:
amount recency frequency
0 2134.498750 5.204186 6.460064
1 205.931613 226.183423 1.085344
2 240.188728 37.193024 1.130437
3 815.752704 108.048616 2.280582
4 666.191186 14.347347 2.667995
5 2408.833819 38.804916 5.996435
6 10200.920787 5.633035 20.695211
--------------------------------------------------
Clusters Insights:#
With the plots and the center in the correct units, let’s see some insights by each clusters groups:
In the three-cluster:
The tree clusters appears have a good stark differences in the Monetary value of the customer, we will confirm this by a box plot.
Cluster 1 is the cluster of high value customer who shops frequently and is certainly an important segment for each business.
In the similar way we obtain customer groups with low and medium spends in clusters with labels 0 and 2, respectively.
Frequency and Recency correlate perfectly to the Monetary value based on the trend (High Monetary-Low Recency-High Frequency).
In the five-cluster:
Note that clusters 0 and 1 are very similar to their cluster in the configuration with only 3 clusters.
The cluster 1 appears more robust on the affirmation of those who shop often and with high amount.
The cluster 2 are those who have a decent spend but are not as frequent as the cluster 1
The cluster 4 purchases medium amounts, with a relatively low frequency and not very recent
The cluster 3 makes low-cost purchases, with a relatively low frequency, but above 1, and made their last purchase more recently. This group of customers probably response to price discounts and can be subject to loyalty promotions to try increase the medium-ticket, strategy that can be better defined when we analyzing the market basket.
The silhouette score matrix says that the five cluster segments are less optimal then the three cluster segments.
In the five-cluster:
Definitely cluster 6 defines those who shop often and with high amount.
Clusters 1 and 5 show good spending and good frequency, only deferring in how recent were their last purchases, where 5 is older, which suggests an active action to sell to group 5 as soon as possible and another to 1 seeking to raise its frequency.
Cluster 0 presents the fourth best purchase and a reasonable frequency, but this is a long time without buying. This group should be sensible to promotions and activations, so that they do not get lost and make their next purchase.
Cluster 5 is similar to 0, but has made its purchases more recently and has a slightly better periodicity. Then actions must be taken to raise their frequency and reduce the chances of them migrating to cluster 0 by staying longer without purchasing products.
Drill Down Clusters:#
To further drill down on this point and find out the quality of these difference, we can label our data with the corresponding cluster label and then visualize these differences. The following code will extract the clustering label and attach it with our customer summary dataset.
customer_history_df['clusters_3'] = cluster_centers[3]['labels']
customer_history_df['clusters_5'] = cluster_centers[5]['labels']
customer_history_df['clusters_7'] = cluster_centers[7]['labels']
display(customer_history_df.head())
fig = plt.figure(figsize=(20,7))
f1 = fig.add_subplot(131)
market = customer_history_df.clusters_3.value_counts()
g = plt.pie(market, labels=market.index, autopct='%1.1f%%', shadow=True, startangle=90)
plt.title('3 Clusters')
f1 = fig.add_subplot(132)
market = customer_history_df.clusters_5.value_counts()
g = plt.pie(market, labels=market.index, autopct='%1.1f%%', shadow=True, startangle=90)
plt.title('5 Clusters')
f1 = fig.add_subplot(133)
market = customer_history_df.clusters_7.value_counts()
g = plt.pie(market, labels=market.index, autopct='%1.1f%%', shadow=True, startangle=90)
plt.title('7 Clusters')
plt.show()
| CustomerID | recency | frequency | amount | recency_log | frequency_log | amount_log | clusters_3 | clusters_5 | clusters_7 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 12346 | 326.117361 | 1 | 77183.60 | 5.787257 | 0.000000 | 11.253942 | 0 | 0 | 5 |
| 1 | 12347 | 2.873611 | 7 | 4310.00 | 1.055569 | 1.945910 | 8.368693 | 1 | 1 | 0 |
| 2 | 12348 | 75.984028 | 4 | 1797.24 | 4.330523 | 1.386294 | 7.494007 | 0 | 0 | 5 |
| 3 | 12349 | 19.124306 | 1 | 1757.55 | 2.950960 | 0.000000 | 7.471676 | 0 | 2 | 4 |
| 4 | 12350 | 310.867361 | 1 | 334.40 | 5.739366 | 0.000000 | 5.812338 | 2 | 4 | 1 |
Once we have the labels assigned to each of the customers, our task is simple. Now we want to find out how the summary of customer in each group is varying. If we can visualize that information we will able to find out the differences in the clusters of customers and we can modify our strategy on the basis of those differences.
The following code leverages plotly and will take the cluster labels we got for each configurations clusters and create boxplots. Plotly enables us to interact with the plots to see the central tendency values in each boxplot in the notebook. Note that we want to avoid the extremely high outlier values of each group, as they will interfere in making a good observation around the central tendencies of each cluster. Since we have only positive values, we will restrict the data such that only data points which are less than 0.95th percentile of the cluster is used. This will give us good information about the majority of the users in that cluster segment.
I’ve used these charts to review my previously stated insights, but follow the same for you to explore:
x_data = ['Cluster 0', 'Cluster 1','Cluster 2','Cluster 3','Cluster 4', 'Cluster 5', 'Cluster 6']
colors = ['rgba(93, 164, 214, 0.5)', 'rgba(255, 144, 14, 0.5)', 'rgba(44, 160, 101, 0.5)', 'rgba(255, 65, 54, 0.5)',
'rgba(22, 80, 57, 0.5)', 'rgba(127, 65, 14, 0.5)', 'rgba(207, 114, 255, 0.5)', 'rgba(127, 96, 0, 0.5)']
cutoff_quantile = 95
for n_clusters in range(3,K_best+1,2):
cl = 'clusters_' + str(n_clusters)
for fild in range(0, 3):
field_to_plot = features[fild]
y_data = list()
ymax = 0
for i in np.arange(0,n_clusters):
y0 = customer_history_df[customer_history_df[cl]==i][field_to_plot].values
y0 = y0[y0<np.percentile(y0, cutoff_quantile)]
if ymax < max(y0): ymax = max(y0)
y_data.insert(i, y0)
traces = []
for xd, yd, cls in zip(x_data[:n_clusters], y_data, colors[:n_clusters]):
traces.append(go.Box(y=yd, name=xd, boxpoints=False, jitter=0.5, whiskerwidth=0.2, fillcolor=cls,
marker=dict( size=1, ),
line=dict(width=1),
))
layout = go.Layout(
title='Difference in {} with {} Clusters and {:1.2f} Score'.\
format(field_to_plot, n_clusters, cluster_centers[n_clusters]['silhouette_score']),
yaxis=dict( autorange=True, showgrid=True, zeroline=True,
dtick = int(ymax/10),
gridcolor='black', gridwidth=0.1, zerolinecolor='rgb(255, 255, 255)', zerolinewidth=2, ),
margin=dict(l=40, r=30, b=50, t=50, ),
paper_bgcolor='white',
plot_bgcolor='white',
showlegend=False
)
fig = go.Figure(data=traces, layout=layout)
py.offline.iplot(fig)
Next Steps in the Segmentation:#
To enhance discovery and can further improve the quality of clustering by adding relevant features, other customer information and purchases details may be included in this dataset.
For example, but not limited to:
New indicators, such as customer relationship time, based on the date of your first purchase of the client
whether the customer is from abroad or not
some group or category of product to be obtained through the SKUs
External data vendors and use it, and so on.
Another dimension to explore can be trying out different algorithms for performing the segmentation for instance hierarchical clustering, which we explored in some of the earlier chapters. A good segmentation process will encompass all these avenues to arrive at optimal segments that provide valuable insight.
Cross Selling#
The cross selling is the ability to sell more products to a customer by analyzing the customer’s shopping trends as well as general shopping trends and patterns which are in common with the customer’s shopping patterns. More often than not, these recommended products would be very appealing. The retailer will often offer you a bundle of products with some attractive offer and it is highly likely that we will end up buying the bundled products instead of just the original item.
So, we research the customer transactions and find out potential additions to the customer’s original needs and offer it to the customer as a suggestion in the hope and intent that they buy them benefiting both the customer as well as the retail establishment.
In this section, we explore association rule-mining, a powerful technique that can be used for cross selling, then we apply the concept of market basket analysis to our retail transactions dataset.
Market Basket Analysis with Association Rule-Mining#
The whole concept of association rule-mining is based on the concept that customer purchase behavior has a pattern which can be exploited for selling more items to the customer in the future.
Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. This rule-based approach also generates new rules as it analyzes more data. The ultimate goal, assuming a large enough dataset, is to help a machine mimic the human brain’s feature extraction and abstract association capabilities from new uncategorized data.
An association rule usually has the structure like below:
This rule can be read in the obvious manner that when the customer bought items on the left of the rule he is likely to buy the item on the right. Following are some vital concepts pertaining to association rule-mining.
Itemset: Is just a collection of one or more items that occur together in a transaction. For example, here {milk, bread} is example of an itemset.
Support: is defined as number of times an itemset appears in the dataset. The support of X with respect to T is defined as the proportion of transactions t in the dataset which contains the itemset X. Mathematically it is defined as:
Confidence: Confidence is an indication of how often the rule has been found to be true. It is a measure of the times the number of times a rule is found to exist in the dataset. For a rule which states { beer -> diaper } the confidence is mathematically defined as:
Lift: Lift of the rule is defined as the ratio of observed support to the support expected in the case the elements of the rule were independent. For the previous set of transactions if the rule is defined as { X -> Y }, then the lift of the rule is defined as:
If the rule had a lift of 1, it would imply that the probability of occurrence of the antecedent and that of the consequent are independent of each other. When two events are independent of each other, no rule can be drawn involving those two events.
If the lift is > 1, that lets us know the degree to which those two occurrences are dependent on one another, and makes those rules potentially useful for predicting the consequent in future data sets.
If the lift is < 1, that lets us know the items are substitute to each other. This means that presence of one item has negative effect on presence of other item and vice versa.
Frequent itemset: Frequent itemsets are itemsets whose support is greater than a user defined support threshold.
Conviction: Is the ratio of the expected frequency that item X occurs without a item Y (that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were independent divided by the observed frequency of incorrect predictions. The conviction of a rule is defined as:
Algorithms:#
Some well-known algorithms are Apriori, Eclat and FP-Growth, but they only do half the job, since they are algorithms for mining frequent itemsets. Another step needs to be done after to generate rules from frequent itemsets found in a database.
The major bottleneck in any association rule-mining algorithm is the generation of frequent itemsets. If the transaction dataset is having k unique products, then potentially we have 2k possible itemsets.
Apriori#
Apriori uses a breadth-first search strategy to count the support of itemsets and uses a candidate generation function which exploits the downward closure property of support. So, the algorithm will first generate these itemsets and then proceed to finding the frequent itemsets. For around 100 unique products the possible number of itemsets is huge, and shows up that the Apriori algorithm prohibitively computationally expensive.
Eclat algorithm#
Eclat is a depth-first search algorithm based on set intersection. It is suitable for both sequential as well as parallel execution with locality-enhancing properties.
FP Growth#
FP stands for frequent pattern. The FP growth algorithm is superior to Apriori algorithm as it doesn’t need to generate all the candidate itemsets. The FP growth algorithm uses a divide-and-conquer strategy and leverages a special data structure called the FP-tree, to find frequent itemsets without generating all itemsets. The core steps of the algorithm are as follows:
In the first pass, the algorithm take in the transactional database and counts occurrence of items (attribute-value pairs) in the dataset, and stores them to ‘header table’.
In the second pass, it builds the FP-tree structure by inserting instances to represent frequent itemsets. Items in each instance have to be sorted by descending order of their frequency in the dataset, so that the tree can be processed quickly. Items in each instance that do not meet minimum coverage threshold are discarded. If many instances share most frequent items, FP-tree provides high compression close to tree root.
Divide this compressed representation into multiple conditional datasets such that each one is associated with a frequent pattern.
Mine for patterns in each such dataset so that shorter patterns can be recursively concatenated to longer patterns, hence making it more efficient.
Recursive processing of this compressed version of main dataset grows large item sets directly, instead of generating candidate items and testing them against the entire database. Growth starts from the bottom of the header table (having longest branches), by finding all instances matching given condition. New tree is created, with counts projected from the original tree corresponding to the set of instances that are conditional on the attribute, with each node getting sum of its children counts. Recursive growth ends when no individual items conditional on the attribute meet minimum support threshold, and processing continues on the remaining header items of the original FP-tree.
Once the recursive process has completed, all large item sets with minimum coverage have been found, and association rule creation begins.
Build Transaction Dataset#
In order to perform our data in these algorithms, we must first turn them into a sales event table where each product sold will be represented by a column, having its value 1 for when it was sold in that event or zero when not. This will generate a sparse table
items = list(cs_df.Description.unique())
grouped = cs_df.groupby('InvoiceNo')
transaction_level = grouped.aggregate(lambda x: tuple(x)).reset_index()[['InvoiceNo','Description']]
transaction_dict = {item:0 for item in items}
output_dict = dict()
temp = dict()
for rec in transaction_level.to_dict('records'):
invoice_num = rec['InvoiceNo']
items_list = rec['Description']
transaction_dict = {item:0 for item in items}
transaction_dict.update({item:1 for item in items if item in items_list})
temp.update({invoice_num:transaction_dict})
new = [v for k,v in temp.items()]
transaction_df = pd.DataFrame(new)
Prune Dataset for frequently purchased items#
We saw in the earlier on EDA how only a handful of items are responsible for bulk of our sales so we want to prune our dataset to reflect this information. For this we have created a function prune_dataset below, which will help us reduce the size of our dataset based on our requirements. The function can be used for performing three types of pruning:
Pruning based on percentage of total sales: The parameter total_sales_perc will help us select the number of items that will explain the required percentage of sales. The default value is 50% or 0.5.
Pruning based on ranks of items: Another way to perform the pruning is to specify the starting and the ending rank of the items for which we want to prune our dataset.
Pruning based on list of features passed to the parameter TopCols.
By default, we will only look for transactions which have at least two items, as transactions with only one item are counter to the whole concept of association rule-mining.
def prune_dataset(input_df, length_trans = 2, total_sales_perc = 0.5,
start_item = None, end_item = None, TopCols = None):
if 'total_items' in input_df.columns:
del(input_df['total_items'])
item_count = input_df.sum().sort_values(ascending = False).reset_index()
total_items = sum(input_df.sum().sort_values(ascending = False))
item_count.rename(columns={item_count.columns[0]:'item_name',
item_count.columns[1]:'item_count'}, inplace=True)
if TopCols:
input_df['total_items'] = input_df[TopCols].sum(axis = 1)
input_df = input_df[input_df.total_items >= length_trans]
del(input_df['total_items'])
return input_df[TopCols], item_count[item_count.item_name.isin(TopCols)]
elif end_item > start_item:
selected_items = list(item_count[start_item:end_item].item_name)
input_df['total_items'] = input_df[selected_items].sum(axis = 1)
input_df = input_df[input_df.total_items >= length_trans]
del(input_df['total_items'])
return input_df[selected_items],item_count[start_item:end_item]
else:
item_count['item_perc'] = item_count['item_count']/total_items
item_count['total_perc'] = item_count.item_perc.cumsum()
selected_items = list(item_count[item_count.total_perc < total_sales_perc].item_name)
input_df['total_items'] = input_df[selected_items].sum(axis = 1)
input_df = input_df[input_df.total_items >= length_trans]
del(input_df['total_items'])
return input_df[selected_items], item_count[item_count.total_perc < total_sales_perc]
We use the second option of pruning, by the Top 15th products in sales events.
output_df, item_counts = prune_dataset(input_df=transaction_df, length_trans=2,start_item = 0, end_item = 15)
print('Total of Sales Amount by the Top 15 Products in Sales Events (Invoice): {:.2f}'.format(AmoutSum[Top15ev].sum()))
print('Number of Sales Events:', output_df.shape[0])
print('Number of Products:', output_df.shape[1])
item_counts
Total of Sales Amount by the Top 15 Products in Sales Events (Invoice): 778377.21
Number of Sales Events: 4664
Number of Products: 15
| item_name | item_count | |
|---|---|---|
| 0 | WHITE HANGING HEART T-LIGHT HOLDER | 1978 |
| 1 | REGENCY CAKESTAND 3 TIER | 1703 |
| 2 | JUMBO BAG RED RETROSPOT | 1600 |
| 3 | PARTY BUNTING | 1379 |
| 4 | ASSORTED COLOUR BIRD ORNAMENT | 1375 |
| 5 | LUNCH BAG RED RETROSPOT | 1289 |
| 6 | SET OF 3 CAKE TINS PANTRY DESIGN | 1146 |
| 7 | POSTAGE | 1099 |
| 8 | JUMBO BAG VINTAGE DOILY | 1080 |
| 9 | LUNCH BAG BLACK SKULL. | 1052 |
| 10 | LUNCH BAG SUKI DESIGN | 1043 |
| 11 | POPCORN HOLDER | 1035 |
| 12 | PACK OF 72 RETROSPOT CAKE CASES | 1029 |
| 13 | SPOTTY BUNTING | 1009 |
| 14 | LUNCH BAG VINTAGE DOILY | 1006 |
So we find out that we have 15 items responsible for 8,73% of sales amount and close to 5% of the events result in 4.664 transactions that have those items along with other items. The next step is to convert this selected data into the required table data structure.
Association Rule Mining with FP Growth#
Orange Table Data Structure#
Since we are using the Orage framework we still have to convert it to the Table data structure by providing the metadata about our columns. We need to define the domain for each of our variables. The domain means the possible set of values that each of our variables can use. This information will be stored as metadata and will be used in later transformation of the data. As our columns are only having binary values,we can easily create the domain by using this information.
# Definisi domain data dengan menyatakan setiap variabel sebagai DiscreteVariable dengan nilai (0, 1)
domain_transac = Domain([DiscreteVariable.make(name=item, values=['0', '1']) \
for item in input_assoc_rules.columns])
# Membuat struktur data menggunakan domain yang telah didefinisikan
data_tran = Orange.data.Table.from_numpy(domain=domain_transac,
X=input_assoc_rules.to_numpy(), Y=None)
# Kode untuk mengkodekan seluruh domain sebagai variabel biner
data_tran_en, mapping = OneHot.encode(data_tran, include_class=True)
Creating our rules#
We need to specify two pieces of information for generating our rules: support and confidence. An important piece of information is to start with a higher support, as lower support will mean a higher number of frequent itemsets and hence a longer execution time. We will specify a min support of 0.01 and see the number of frequent itemsets that we get before we specify confidence and generate our rules.
support = 0.01
print("num of required transactions = ", int(input_assoc_rules.shape[0]*support))
num_trans = input_assoc_rules.shape[0]*support
itemsets = dict(frequent_itemsets(data_tran_en, support))
print('Items Set Size:', len(itemsets))
num of required transactions = 46
Items Set Size: 663273
So we get a whopping 663.273 itemsets for only 15 itens and a support of only 1%! This will increase exponentially if we decrease the support or if we increase the number of items in our dataset. The next step is specifying a confidence value and generating our rules. The following code snippet will perform rule generation and decoding of rules, and then compile it all in a neat dataframe that we can use for further analysis.
confidence = 0.6
rules_df = pd.DataFrame()
if len(itemsets) < 1000000:
rules = [(P, Q, supp, conf)
for P, Q, supp, conf in association_rules(itemsets, confidence)
if len(Q) == 1 ]
names = {item: '{}={}'.format(var.name, val)
for item, var, val in OneHot.decode(mapping, data_tran, mapping)}
eligible_ante = [v for k,v in names.items() if v.endswith("1")]
N = input_assoc_rules.shape[0]
rule_stats = list(rules_stats(rules, itemsets, N))
rule_list_df = []
for ex_rule_frm_rule_stat in rule_stats:
ante = ex_rule_frm_rule_stat[0]
cons = ex_rule_frm_rule_stat[1]
named_cons = names[next(iter(cons))]
if named_cons in eligible_ante:
rule_lhs = [names[i][:-2] for i in ante if names[i] in eligible_ante]
ante_rule = ', '.join(rule_lhs)
if ante_rule and len(rule_lhs)>1 :
rule_dict = {'support' : ex_rule_frm_rule_stat[2],
'confidence' : ex_rule_frm_rule_stat[3],
'coverage' : ex_rule_frm_rule_stat[4],
'strength' : ex_rule_frm_rule_stat[5],
'lift' : ex_rule_frm_rule_stat[6],
'leverage' : ex_rule_frm_rule_stat[7],
'antecedent': ante_rule,
'consequent':named_cons[:-2] }
rule_list_df.append(rule_dict)
rules_df = pd.DataFrame(rule_list_df)
print("Raw rules data frame of {} rules generated".format(rules_df.shape[0]))
if not rules_df.empty:
pruned_rules_df = rules_df.groupby(['antecedent','consequent']).max().reset_index()
else:
print("Unable to generate any rule")
Raw rules data frame of 25247 rules generated
Explore The Association Rule Created#
Let’s see what we get in the first 5 rules with highest confidence:
dw = pd.options.display.max_colwidth
pd.options.display.max_colwidth = 100
(rules_df[['consequent', 'antecedent', 'support','confidence','lift']].\
groupby(['consequent', 'antecedent']).first()
.reset_index()
.sort_values(['confidence', 'support', 'lift'],
ascending=False)).head()
| consequent | antecedent | support | confidence | lift | |
|---|---|---|---|---|---|
| 20 | JUMBO BAG VINTAGE DOILY | JUMBO BAG RED RETROSPOT, LUNCH BAG RED RETROSPOT, LUNCH BAG VINTAGE DOILY | 60 | 0.909091 | 4.520256 |
| 133 | LUNCH BAG VINTAGE DOILY | LUNCH BAG RED RETROSPOT, JUMBO BAG VINTAGE DOILY , LUNCH BAG BLACK SKULL., LUNCH BAG SUKI DESIGN | 49 | 0.890909 | 4.596460 |
| 134 | LUNCH BAG VINTAGE DOILY | LUNCH BAG RED RETROSPOT, JUMBO BAG VINTAGE DOILY , LUNCH BAG SUKI DESIGN | 48 | 0.888889 | 4.586037 |
| 131 | LUNCH BAG VINTAGE DOILY | LUNCH BAG RED RETROSPOT, JUMBO BAG VINTAGE DOILY , LUNCH BAG BLACK SKULL. | 48 | 0.872727 | 4.502655 |
| 21 | JUMBO BAG VINTAGE DOILY | JUMBO BAG RED RETROSPOT, LUNCH BAG SUKI DESIGN , LUNCH BAG VINTAGE DOILY | 48 | 0.872727 | 4.339446 |
Now, the first 5 higest support:
(rules_df[['consequent', 'antecedent', 'support','confidence','lift']].\
groupby(['consequent', 'antecedent']).first()
.reset_index()
.sort_values(['support', 'confidence', 'lift'],
ascending=False)).head()
| consequent | antecedent | support | confidence | lift | |
|---|---|---|---|---|---|
| 123 | LUNCH BAG VINTAGE DOILY | JUMBO BAG VINTAGE DOILY , LUNCH BAG RED RETROSPOT | 154 | 0.777778 | 4.012783 |
| 28 | JUMBO BAG VINTAGE DOILY | LUNCH BAG VINTAGE DOILY , JUMBO BAG RED RETROSPOT | 153 | 0.805263 | 4.003995 |
| 92 | LUNCH BAG RED RETROSPOT | LUNCH BAG VINTAGE DOILY , LUNCH BAG SUKI DESIGN | 149 | 0.668161 | 2.672646 |
| 77 | LUNCH BAG RED RETROSPOT | LUNCH BAG BLACK SKULL., LUNCH BAG SUKI DESIGN | 142 | 0.601695 | 2.406780 |
| 4 | JUMBO BAG RED RETROSPOT | JUMBO BAG VINTAGE DOILY , LUNCH BAG RED RETROSPOT | 138 | 0.696970 | 2.688723 |
Typically, a lift value of 1 indicates that the probability of occurrence of the antecedent and consequent together are independent of each other. Hence, the idea is to look for rules having a lift much greater than 1. So, let’s see how much rules has lift greater than 1, equal 1 and less than one:
rules_df.lift.apply(lambda x: 'Greater Than One' if x > 1 else 'One' \
if x == 0 else 'Less Than One').value_counts()
lift
Greater Than One 25247
Name: count, dtype: int64
So all rules are valid.
pd.options.display.max_colwidth = dw
Conclusion#
As we have seen, the generation of rules is a very simple process, but very computationally expensive, since it grows exponentially with the increase of the set of items.
Overall, we seek the proper balance between support and confidence leading to a reasonable number of strong rules.
In the other hand, if the goal is to identify rare but with high confidence patterns, we should proceed as previously, by establishing a low level of support and a higher level of confidence, which leads to a large number of rules.
With this in mind, the rules with low support and high confidence would then be our target for further study and than outlining of strategies to raise cross selling.